Categorical data
Our first steps in working and modeling categorical data with R
Welcome
Introduction!
Welcome to our tenth tutorial for the Statistics II: Statistical Modeling & Causal Inference (with R) course.
During this week’s lecture you were introduced to categorical data and different approaches to model it.
In this lab session we will:
- Introduce a logic of working with categorical variables with
{forcats} - How to perform binary logistic regression in
R
- How to interpret and visualize results with
{marginaleffects}
Logistic Regression in R
Regression is a key tool in statistics for exploring relationships between variables. Some of the two most common types are linear and logistic regression. In most cases, we use:
- Linear regression when the dependent variable is quantitative.
- Logistic regression when the dependent variable is qualitative.
Types of regression (so far)
Linear regression
- Simple linear regression: one independent variable, continuous dependent variable.
- Multiple linear regression: two or more independent variables, continuous dependent variable.
Logistic regression
- Binary logistic regression: binary dependent variable (two outcomes).
- Multinomial logistic regression: nominal dependent variable with 3+ unordered categories.
- Ordinal logistic regression: ordinal dependent variable with 3+ ordered categories.
Logistic and probit regressions,
which you saw during this week’s lecture, are both part of
Generalized Linear Models (GLM), which extend linear
models to handle non-continuous outcomes and non-normal residuals. In
the labs we will focus on logistic regression, but the general
syntax to set up the models and logic of extracting predictive margins
with {marginaleffects} can be applied to probit
models.
Let’s turn to the task at hand
The Ministry has asked you to analyze these data from a nationally representative survey of 20,000 Burovian residents, focusing on the question:
Do you support the construction of wind turbines within 5 kilometers of residential areas?
The ministry is particularly interested in:
- Whether opposition is driven more by ideology, self-interest, or misinformation
- How support varies based on income, proximity, environmental beliefs, and trust in government
- Whether seemingly “pro-environmental” individuals are still susceptible to NIMBY tendencies
Your findings will inform public communication strategy, siting policies, and the design of community engagement programs.
Logistic regression with one independent variable
Say we want to examine the relationship between income and the likelihood of supporting wind turbine projects.
A binary logistic regression is appropriate when the dependent variable is categorical with two levels (e.g., support vs. no support), and the independent variable(s) can be either categorical or continuous.
In this case, the dependent variable is whether a resident supports
the wind turbines (coded as 1 for support and
0 for no support), and the independent variable is
income (a continuous variable).
In R, logistic regression can be done using the glm()
function, specifying with the argument family = "binomial".
The syntax is similar to other formulas we have used in the past:
Let’s build our model
income_model <-
glm(support_wind_turbines ~ income_1000,
family = binomial,
data = wind_factors_df
)
modelsummary::modelsummary(income_model,
stars = c('*' = .1, '**' = .05, '***' = .01),
statistic = 'conf.int')| (1) | |
|---|---|
| * p < 0.1, ** p < 0.05, *** p < 0.01 | |
| (Intercept) | -0.759*** |
| [-0.836, -0.683] | |
| income_1000 | 0.039*** |
| [0.038, 0.041] | |
| Num.Obs. | 20000 |
| AIC | 18986.1 |
| BIC | 19001.9 |
| Log.Lik. | -9491.031 |
| F | 2415.373 |
| RMSE | 0.39 |
How to interpret our coefficients? 🤔
The (Intercept) (\(β_0\)) is the
constant term, and the income_1000 coefficient (\(β_1\)) shows the expected changes of income
on the likelihood of supporting wind turbines.
- The estimate for
income_1000: This coefficient indicates how much the log-odds of supporting wind turbines changes with a one-unit increase in income (in thousands).
Interpreting log-odds is not the most intuitive. A general rule to keep in mind is:
- when \(β_1 = 0\), \(X\) and \(Y\) are independent,
- when \(β_1 > 0\), the probability that \(Y=1\) increases with \(X\),
- when \(β_1 < 0\), the probability that \(Y=1\) decreases with \(X\),
When \(β_1\) is positive, an increase in income increases the probability of supporting wind turbines.
Turning it into odds ratios: We can also shift to odds ratios to quantify this relationship and interpret it as follows:
\[exp(β_1) = exp(0.039) = 1.0397\]
Based on this result, we can say that an extra 1,000 dollars of income increases the odds (that is, the chance) of supporting by a factor of ≈1.04.
In R:
## [1] 1.03977
Predicting from our models
Estimating the relationship between variables is one reason for
building models. Another goal is to predict the dependent variable based
on newly observed values of the independent variable(s). This can be
done with the predict() function.
Suppose we would like to predict the probability of supporting building wind turbines within 5km of residential areas for an individuals with $70000 of income:
## 1
## 0.8805791
We would predict that a Burovian with $70000 of income would have an 88% change of supporting the building of wind turbines within 5km of residential areas.
We can also construct a confidence interval for this prediction, it
can be done by adding the se = TRUE argument in the
predict() function:
pred_income <- predict(income_model,
newdata = data.frame(income_1000 = 70),
type = "response",
se = T
)
pred_income$fit## 1
## 0.8805791
## creating CIs
lower <- pred_income$fit - (qnorm(0.975) * pred_income$se.fit)
upper <- pred_income$fit + (qnorm(0.975) * pred_income$se.fit)
c(lower, upper)## 1 1
## 0.8747801 0.8863782
Visualizing the results of the model
We can also use the {marginaleffects} package to provide
a visual depiction of our model. We can use the function
marginaleffects::plot_predictions() to plot our conditional
predictions.
It is wrapped around {ggplot2}, so we can use some of
the syntax we learned in the past to go beyond the defaults. We can also
show where our prediction for $70K lies:
marginaleffects::plot_predictions(income_model, condition = "income_1000") +
geom_point(x = 70, y = 0.8805791, color = "#cc0065", size = 2) +
geom_vline(xintercept = 70, alpha = 0.5, color = "#cc0065", linetype = "dashed") + # this is our prediction for 70
theme_minimal() +
labs(x = "Income (thousands)",
y = "Predicted probability of support\nfor wind turbines")Logistic regression with multiple independent variables
The interpretation of the coefficients in multivariable logistic regression is similar to the interpretation in univariable regression, except that this time it estimates the multiplicative change in the odds in favor of \(Y = 1\) when \(X\) increases by 1 unit, while the other independent variables remain unchanged.
For this illustration, suppose we would like to estimate the relationship between supporting wind turbine projects and all variables present in the data frame:
multivariate_model <- glm(support_wind_turbines ~ proximity_to_proposed + env_concern +
trust_in_government + left_right_ideology + age + income_1000 + educ,
family = binomial,
data = wind_factors_df)
modelsummary::modelsummary(multivariate_model,
stars = c('*' = .1, '**' = .05, '***' = .01),
statistic = 'conf.int')| (1) | |
|---|---|
| * p < 0.1, ** p < 0.05, *** p < 0.01 | |
| (Intercept) | -0.862*** |
| [-1.024, -0.700] | |
| proximity_to_proposed1 | -0.783*** |
| [-0.866, -0.699] | |
| env_concernMedium | 1.198*** |
| [1.106, 1.290] | |
| env_concernHigh | 2.326*** |
| [2.212, 2.441] | |
| trust_in_government1 | 0.526*** |
| [0.447, 0.605] | |
| left_right_ideology1 | -0.652*** |
| [-0.731, -0.572] | |
| age | -0.021*** |
| [-0.023, -0.019] | |
| income_1000 | 0.049*** |
| [0.047, 0.051] | |
| educUniversity degree | 0.291*** |
| [0.204, 0.377] | |
| educPostgraduate degree | 0.635*** |
| [0.524, 0.747] | |
| Num.Obs. | 20000 |
| AIC | 15689.4 |
| BIC | 15768.4 |
| Log.Lik. | -7834.675 |
| F | 421.034 |
| RMSE | 0.35 |
Like previous example with one predictor, it is easier to interpret these relationships through OR. But this time, we also print the 95% CI of the OR in addition to the OR (rounded to 3 decimals) so that we can easily see which ones are significantly different from 1:
round(exp(cbind(OR = coef(multivariate_model), confint(multivariate_model))), 3) %>%
knitr::kable() %>%
kableExtra::kable_styling()| OR | 2.5 % | 97.5 % | |
|---|---|---|---|
| (Intercept) | 0.422 | 0.359 | 0.496 |
| proximity_to_proposed1 | 0.457 | 0.420 | 0.497 |
| env_concernMedium | 3.313 | 3.023 | 3.632 |
| env_concernHigh | 10.238 | 9.138 | 11.490 |
| trust_in_government1 | 1.692 | 1.563 | 1.831 |
| left_right_ideology1 | 0.521 | 0.481 | 0.564 |
| age | 0.979 | 0.977 | 0.981 |
| income_1000 | 1.051 | 1.049 | 1.052 |
| educUniversity degree | 1.337 | 1.227 | 1.459 |
| educPostgraduate degree | 1.887 | 1.689 | 2.110 |
From the ORs and their 95% CIs computed above, we conclude that:
Proximity to the proposed site: individuals who live close to the proposed site have odds of supporting the project that are 0.46 times those of individuals who do not live nearby, all else being equal. This indicates a negative association between proximity and support.
Environmental concern: individuals with medium environmental concern have odds of support that are 3.31 times those with low concern, while individuals with high environmental concern have odds 10.24 times higher than those with low concern, all else being equal. These strong and significant associations suggest that higher environmental concern is positively associated with support for the project.
Trust in government: individuals who trust the government have odds of supporting the project that are 1.69 times higher than those who do not, all else being equal.
Left-right ideology: individuals who lean right on the ideological spectrum have odds of support that are 0.52 times those who lean left, indicating a negative relationship between right-leaning ideology and support for the project.
Age: for each additional year of age, the odds of supporting the project are multiplied by 0.98, suggesting that older individuals are slightly less likely to support the project, all else being equal.
Income: for each additional 1,000 Burovian dollars of income, the odds of supporting the project increase by a factor of 1.05. This positive and significant association indicates that wealthier individuals are more likely to support the project.
Education: individuals with a university degree have odds of support that are 1.34 times those of individuals with a lower level of education, while those with a postgraduate degree have odds that are 1.89 times higher, all else being equal.
Intercept: we refrain from interpreting the intercept, as it represents the odds of support for a hypothetical individual with all predictors set to zero or to the reference category, which is not substantively meaningful in this context.
Interactions
In the previous sections, we omitted potential interaction effects.
As we learned last week, an interaction occurs when the relationship between an independent variable and the dependent variable depends on the value taken by another independent variable. In other words, if the effect of one predictor on the outcome changes depending on the level of another predictor, then there is evidence of an interaction.
Based on what we know from Burovia, we hypothesize that our expectations of proximity to a proposed development on support for the development may vary depending on a respondent’s level of environmental concern.
Why might environmental concern and proximity interact?
From the perspective of environmental sociology and the literature on Not In My Back Yard (NIMBY) dynamics—something that the Ministry itself seems to suspect might influence public reactions—prior research suggests that people’s support or opposition to local developments is shaped by both their broader values and worldviews, such as environmental concern, and their geographic proximity to the proposed site.
While environmental concern might lead people to support renewable energy in principle, this support can become conditional when the development is proposed near their homes. This is a classic NIMBY scenario: a person may be in favor of wind energy as a policy, but not want turbines visible from their backyard.
Those high in environmental concern may be more likely to support wind energy generally. But when the project is close to them, they may become skeptical—worrying about noise, visual impacts, or ecological disruption to their immediate surroundings. Their support is principled, but may falter when the costs feel personal.
Those low in environmental concern may be less engaged either way. Proximity might not change their views significantly, or they may focus more on property rights, aesthetics, or distrust of government projects regardless of environmental reasoning.
This leads to a plausible interaction hypothesis:
Our expectations for the changes in support for the construction of wind turbines within 5km of resiential areas by proximity to a proposed development depends on a person’s level of environmental concern.
interaction_model <- glm(support_wind_turbines ~ proximity_to_proposed * env_concern +
trust_in_government + left_right_ideology + age + income_1000 + educ,
family = binomial,
data = wind_factors_df)
modelsummary::modelsummary(interaction_model,
stars = c('*' = .1, '**' = .05, '***' = .01),
statistic = 'conf.int')| (1) | |
|---|---|
| * p < 0.1, ** p < 0.05, *** p < 0.01 | |
| (Intercept) | -0.857*** |
| [-1.021, -0.694] | |
| proximity_to_proposed1 | -0.794*** |
| [-0.910, -0.677] | |
| env_concernMedium | 1.178*** |
| [1.067, 1.290] | |
| env_concernHigh | 2.341*** |
| [2.200, 2.485] | |
| trust_in_government1 | 0.525*** |
| [0.447, 0.605] | |
| left_right_ideology1 | -0.652*** |
| [-0.732, -0.572] | |
| age | -0.021*** |
| [-0.023, -0.019] | |
| income_1000 | 0.049*** |
| [0.047, 0.051] | |
| educUniversity degree | 0.291*** |
| [0.204, 0.378] | |
| educPostgraduate degree | 0.635*** |
| [0.524, 0.746] | |
| proximity_to_proposed1 × env_concernMedium | 0.056 |
| [-0.132, 0.245] | |
| proximity_to_proposed1 × env_concernHigh | -0.036 |
| [-0.263, 0.191] | |
| Num.Obs. | 20000 |
| AIC | 15692.7 |
| BIC | 15787.6 |
| Log.Lik. | -7834.370 |
| F | 344.348 |
| RMSE | 0.35 |
Getting predictive margins
##
## env_concern proximity_to_proposed Estimate Std. Error z Pr(>|z|) S 2.5 %
## Low 0 0.667 0.00535 124.6 <0.001 Inf 0.656
## Low 1 0.529 0.00860 61.6 <0.001 Inf 0.513
## Medium 0 0.823 0.00525 156.8 <0.001 Inf 0.812
## Medium 1 0.734 0.00892 82.2 <0.001 Inf 0.716
## High 0 0.926 0.00381 243.0 <0.001 Inf 0.918
## High 1 0.866 0.00735 117.8 <0.001 Inf 0.852
## 97.5 %
## 0.677
## 0.546
## 0.833
## 0.751
## 0.933
## 0.880
##
## Type: response
## Columns: env_concern, proximity_to_proposed, estimate, std.error, statistic, p.value, s.value, conf.low, conf.high
marginaleffects::plot_predictions(interaction_model, by = c("env_concern", "proximity_to_proposed")) +
labs(x = "Environmental concern level",
y = "Predicted probability of support\nfor wind turbines",
color = "Proximity to proposed site") +
scale_color_manual(labels = c("0" = "Does not live nearby",
"1" = "Lives nearby"),
values = c("#5C9184","#550055")) +
theme_minimal() +
theme(legend.position = "bottom")Drafting some brief recommendations
After analyzing the data from the national survey, you report back to the Ministry. You find that, on average, support for the wind turbine project is moderately high—but not uniform. Notably, support tends to decline among individuals who live within 5 kilometers of a proposed site, pointing to a classic NIMBY effect even for seemingly “pro-environmental” individuals.
You explain that most Burovians may support renewable energy in principle, but become more skeptical when the costs feel personal. This finding suggests that even pro-environmental values may not be sufficient to ensure local support in the absence of targeted engagement.
Acknowledgements
This tutorial is inspired by some sections of Antoine Soetewey’s Binary logistic regression in R. You should also take a look at the tutorial if interested on prediction and assessing model quality for that purpose.
This script was drafted by Carolina Díaz Suárez and Sebastian Ramirez-Ruiz, with contributions by Sofía García-Durrer, and William Fernandez.